Introduction

This project is part of the Udacity’s Data Analyst Nanodegree Program. I conducted an Exploratory Data Analysis (EDA) on one of the curated data sets, provided by Udacity, from Prosper, which is America’s first marketplace lending platform, with over $12 billion in funded loans as of June 2021.

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The dataset contains loans created between 2005-Q4 and 2014-Q1 with a last updated date of 03/11/2014. Here’s a link to the variable definitions for this dataset.

The main purpose of this project is to summarize the characteristics of variables that can affect the loan status and to get some ideas about the relationships among multiple variables using summary statistics and data visualizations.

Initial Dataset Exploration

First, we load the Prosper Loan Data. Here are the observations and variable counts.

## [1] 113937     81

Here are the object names of the dataset.

##  [1] "ListingKey"                          "ListingNumber"                      
##  [3] "ListingCreationDate"                 "CreditGrade"                        
##  [5] "Term"                                "LoanStatus"                         
##  [7] "ClosedDate"                          "BorrowerAPR"                        
##  [9] "BorrowerRate"                        "LenderYield"                        
## [11] "EstimatedEffectiveYield"             "EstimatedLoss"                      
## [13] "EstimatedReturn"                     "ProsperRating..numeric."            
## [15] "ProsperRating..Alpha."               "ProsperScore"                       
## [17] "ListingCategory..numeric."           "BorrowerState"                      
## [19] "Occupation"                          "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"            "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                    "GroupKey"                           
## [25] "DateCreditPulled"                    "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"               "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                  "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"          "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"         "InquiriesLast6Months"               
## [35] "TotalInquiries"                      "CurrentDelinquencies"               
## [37] "AmountDelinquent"                    "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"            "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"              "BankcardUtilization"                
## [43] "AvailableBankcardCredit"             "TotalTrades"                        
## [45] "TradesNeverDelinquent..percentage."  "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                   "IncomeRange"                        
## [49] "IncomeVerifiable"                    "StatedMonthlyIncome"                
## [51] "LoanKey"                             "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"          "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate" "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"            "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"         "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"       "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                          "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                 "LoanOriginationQuarter"             
## [67] "MemberKey"                           "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                 "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                  "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                   "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                 "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                       "Recommendations"                    
## [79] "InvestmentFromFriendsCount"          "InvestmentFromFriendsAmount"        
## [81] "Investors"

Here’s the initial structure of the dataframe.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : chr  "1021339766868145413AB3B" "10273602499503308B223C1" "0EE9337825851032864889A" "0EF5356002482715299901A" ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : chr  "2007-08-26 19:09:29.263000000" "2014-02-27 08:28:07.900000000" "2007-01-05 15:00:47.090000000" "2012-10-22 11:02:35.010000000" ...
##  $ CreditGrade                        : chr  "C" "" "HR" "" ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : chr  "Completed" "Current" "Completed" "Current" ...
##  $ ClosedDate                         : chr  "2009-08-14 00:00:00" "" "2009-12-17 00:00:00" "" ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : chr  "" "A" "" "A" ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : chr  "CO" "CO" "GA" "GA" ...
##  $ Occupation                         : chr  "Other" "Professional" "Other" "Skilled Labor" ...
##  $ EmploymentStatus                   : chr  "Self-employed" "Employed" "Not available" "Employed" ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : chr  "True" "False" "False" "True" ...
##  $ CurrentlyInGroup                   : chr  "True" "False" "True" "False" ...
##  $ GroupKey                           : chr  "" "" "783C3371218786870A73D20" "" ...
##  $ DateCreditPulled                   : chr  "2007-08-26 18:41:46.780000000" "2014-02-27 08:28:14" "2007-01-02 14:09:10.060000000" "2012-10-22 11:02:32" ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : chr  "2001-10-11 00:00:00" "1996-03-18 00:00:00" "2002-07-27 00:00:00" "1983-02-28 00:00:00" ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : chr  "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
##  $ IncomeVerifiable                   : chr  "True" "True" "True" "True" ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : chr  "E33A3400205839220442E84" "9E3B37071505919926B1D82" "6954337960046817851BCB2" "A0393664465886295619C51" ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : chr  "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
##  $ LoanOriginationQuarter             : chr  "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
##  $ MemberKey                          : chr  "1F3E3376408759268057EDA" "1D13370546739025387B2F4" "5F7033715035555618FA612" "9ADE356069835475068C6D2" ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Next, I will use a subset of this dataset for performing an analysis. The main purpose of this analysis is to explore the various factors that affect borrowers’ rates. The following are the variables I would like to use.

##  [1] "LoanNumber"                "LoanOriginalAmount"       
##  [3] "LoanOriginationDate"       "LoanOriginationQuarter"   
##  [5] "Term"                      "LoanStatus"               
##  [7] "BorrowerRate"              "LenderYield"              
##  [9] "ProsperRating..Alpha."     "ProsperScore"             
## [11] "ListingCategory..numeric." "EmploymentStatus"         
## [13] "IsBorrowerHomeowner"       "CreditScoreRangeLower"    
## [15] "CreditScoreRangeUpper"     "InquiriesLast6Months"     
## [17] "PublicRecordsLast10Years"  "IncomeRange"              
## [19] "IncomeVerifiable"

Based on my observation of the variables in this dataset, there are data that are not in good form.

As a result, I’ll transform Term and ListingCategory columns from ‘numeric’ to ‘factor’, and transform ListingCreationDate column from ‘factor’ to ‘Date’. The following is the dataset’s structure transformed.

## 'data.frame':    113937 obs. of  19 variables:
##  $ LoanNumber              : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount      : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate     : chr  "2007-09-12 00:00:00" "2014-03-03 00:00:00" "2007-01-17 00:00:00" "2012-11-01 00:00:00" ...
##  $ LoanOriginationQuarter  : chr  "Q3 2007" "Q1 2014" "Q1 2007" "Q4 2012" ...
##  $ Term                    : Factor w/ 3 levels "1 year","3 years",..: 2 2 2 2 2 3 2 2 2 2 ...
##  $ LoanStatus              : chr  "Completed" "Current" "Completed" "Current" ...
##  $ BorrowerRate            : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield             : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ ProsperRating           : Factor w/ 8 levels "AA","A","B","C",..: NA 2 NA 2 5 3 6 4 1 1 ...
##  $ ProsperScore            : Factor w/ 11 levels "1","2","3","4",..: NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory         : Factor w/ 21 levels "NA","Debt Consolidation",..: 1 3 1 17 3 2 2 3 8 8 ...
##  $ EmploymentStatus        : chr  "Self-employed" "Employed" "Not available" "Employed" ...
##  $ IsBorrowerHomeowner     : chr  "True" "False" "False" "True" ...
##  $ CreditScoreRangeLower   : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper   : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ InquiriesLast6Months    : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ PublicRecordsLast10Years: int  0 1 0 0 0 0 0 1 0 0 ...
##  $ IncomeRange             : chr  "$25,000-49,999" "$50,000-74,999" "Not displayed" "$25,000-49,999" ...
##  $ IncomeVerifiable        : chr  "True" "True" "True" "True" ...

Univariate Plots Section

In this section, I will perform some preliminary exploration of the transformed dataset. I will run some summaries of the data and create univariate plots to understand the structure of the individual variables in this dataset.

Categorical Variables

First, let’s take a look on various categorical variables starting with Term.

## 
##     1 year    3 years    5 years 
## 0.01416572 0.77040821 0.21542607

Above 90% of the loans in this dataset are with term more than 1 year. Most of them is of 3 years.

## 
##              Cancelled             Chargedoff              Completed 
##           0.0000438839           0.1052511476           0.3341671274 
##                Current              Defaulted FinalPaymentInProgress 
##           0.4965551138           0.0440418828           0.0017992399 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##           0.0001404285           0.0070740848           0.0023258467 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##           0.0031859712           0.0027471322           0.0026681412

There are four major loan status: “Defaulted”, “Current”, “Completed”, and “Chargedoff.”

## 
##         AA          A          B          C          D          E         HR 
## 0.06330949 0.17148480 0.18362344 0.21619742 0.16822033 0.11543493 0.08172958 
##         NA 
## 0.00000000

The above bar chart shows Prosper’s rating distribution. The unrated loans are removed in this plot.

## 
##                    Employed     Full-time Not available  Not employed 
##   0.019791639   0.590870393   0.231312041   0.046929443   0.007328611 
##         Other     Part-time       Retired Self-employed 
##   0.033404425   0.009549137   0.006977540   0.053836769

The majority of Prosper users are either employed or have a full-time job.

About half of the users are homeowners.

## 
##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##    0.005450380    0.063842299    0.152163037    0.282542107    0.272519024 
## $75,000-99,999  Not displayed   Not employed 
##    0.148468013    0.067941055    0.007074085

About 70% of borrowers’ income are in the range of $25,000 - $100,000

Most of borrowers have their income verified.

Numerical Variables

Next, let’s take a look on various numerical variables starting with BorrowerRate.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

The BorrowerRate has a right-skewed distribution, with a mean of 19.28%, and median of 18.40%. Similarly, the LenderYield also has a right-skewed distribution, with a mean of 18.27%, and median of 17.30%.

The ranges of credit scores are small compared to the distributions. I will use lower credit score for further analyses.

##  [1]   3   0   1   7   2   5   4  11   6   9  44  NA   8  12  19  42  27  10  15
## [20]  13  21  17  38  16  14  18  22  25  28  20  23  26  29  24  40  41  30  33
## [39]  31  32  34  35  36  63  97  37  46  52 105  53  50

The distribution of InquiriesLast6Months is right-skewed and long-tailed. Outliers like 105 and 53 exist.

##  [1]  0  1  2  5 NA  3  4  7  6 11  8 12  9 10 15 21 13 25 38 14 16 30 20 34 22
## [26] 17

The distribution of PublicRecordsLast10Years is right-skewed and has some outliers.

LoanOriginalAmount also displays a long-tailed distribution, with some numbers especially higher. The reason why these numbers are more frequent might be because they are the exact numbers (eg. 10,000, 15,000, 20,000) that people tend to choose.

Univariate Analysis

What is the structure of your dataset?

This is a very large dataset with 113937 loan records and 81 features on each loan. For simplicity with the analysis, I only use a subset of the origianl dataset. The subset has same number of records (113937), but only with 15 features that I want to explore. Those features (after transformation) are as follow:

Categorical: Term, LoanStatus, ProsperRating, ListingCategory, EmploymentStatus, IsBorrowerHomeowner, IncomeRange, IncomeVerifiable

Numerical: BorrowerRate, LenderYield, CreditScoreRangeLower, CreditScoreRangeUpper, InquiriesLast6Months, PublicRecordsLast10Years, LoanOriginalAmount

What is/are the main feature(s) of interest in your dataset?

The main features of interest in this dataset are BorrowerRate and ProsperRating, where ProsperRating might be the overall assessments of the borrowers performed by Prosper.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The features regarding borrowers’ status (EmploymentStatus, IsBorrowerHomeowner, etc.) and those that reveals borrowers’ past payment history (PublicRecords, Inquiries, etc.)

Did you create any new variables from existing variables in the dataset?

No, I didn’t create any new variables. I only convert some variables from numeric to factor. Thoses variables are either instrinsically categorical or they have only serveral possible values.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I observed some unusal distributions in the histogram of LoanOriginalAmount. The frequencies are higher around 4,000, 5,000, 10,000, 15,000 and so on. I think that’s because these numbers are just easy to pick for most borrowers.

The dataset is tidy and I did not perform other transformation other than changing their classes (ex. numeric -> factor )

Bivariate Plots Section

The correlation matrix shows that there are two strong positive relationships among the features. The first one is the trivial one: credit score upper range and lower range. And the second one is the borrower rate and the lender yield.

## [1] 0.9992113

The borrower rate and the lender yield have nearly perfect linear relationship. This is because Prosper charges a fixed rate on loans so the borrower rate and the lender yield should be strongly positively related.

This plot shows that Prosper ratings do affect borrowers’ rates.

Next, I want to inspect other factors that also have impact on the borrower rate and the Prosper rating. I’ll start with the borrower rate.

## $`1 year`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0400  0.0929  0.1434  0.1501  0.2064  0.2669 
## 
## $`3 years`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1274  0.1815  0.1935  0.2599  0.4975 
## 
## $`5 years`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0669  0.1490  0.1870  0.1930  0.2319  0.3304

The average rate of the 3-year’s is larger than the 1-year’s, but is very close to 5-year loan’s average. And 5-year loans have smaller range than 3-year loans.

## [1] -0.3289599

There exists a fairly weak negative relationship between the borrower rate and the original loan amount.

It seems that some categories have lower rates and others have higher rates.

Borrowers with their income verifiable or are homeowners seem to have lower rates, which makes sense.

Finally in this section, I want to what affect Prosper rating.

A higher credit score by the consumer rating agency also tends to have a higher Prosper rating.

For top ProsperRating groups like “AA” and “A”, Homeowners are the majority.

It seems very hard to get top rating when you don’t have your income verfied.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

First, the lender yield and borrower rate are strongly related because the fixed rate charged by Prosper. Second, the borrower rate and Prosper rating do vary with borrowers’ status.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The relationship between list category and the borrower rate is interesting. In average, cosmetic procedure’s rates are relatively higher, whereas boat’s rates are lower.

What was the strongest relationship you found?

The strongest relationship I found is the one between the borrower rate and the lender yield, with a correlation coefficient very close to 1.

Multivariate Plots Section

##    AA     A     B     C     D     E    HR    NA  NA's 
##  5372 14551 15581 18345 14274  9795  6935     0 29084

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

> Using jitter plot, we can visualize the real distribution of the borrrower rate relative to Prosper rating. In addition, I use different colors to indicate whether the individual borrower is a homeowner. This plot shows that most of the “AA” rated borrowers are also homeowners.

In this plot I found several interesting characteristics. First, each rating has its own colored area that is easily recognized, which indicates that Prosper rating is a good predictor for the borrower rate. Second, only highly rated loans (“AA”, “A”, “B”) can have original amount over $30,000, and most lowly rated loans (“E”, “HR”) are below 10,000.

## # A tibble: 6 x 3
## # Groups:   ProsperRating [1]
##   ProsperRating ListingCategory    BorrowerRate
##   <fct>         <fct>                     <dbl>
## 1 AA            NA                       0.0655
## 2 AA            Debt Consolidation       0.0800
## 3 AA            Home Improvement         0.0789
## 4 AA            Business                 0.0823
## 5 AA            Student Use              0.0821
## 6 AA            Auto                     0.0759

This plot shows the relative value of the averaged borrower rates of different list categories.

Compared to the previous plot, the boxplots give us more information about the outliers and range of the distribution. But it’s more difficult to compare mean values between different categories.

We can know from this plot that “Boat”, “Green loans”, “RV” categories have less data points. This explains why they have less outliers in the previous plots.

> The heat map shows that highly rated “Not employed” borrowers have to pay slight higher rates. The jitter plots shows that even “Not employed”, “Retired”, or “Part-time” borrowers can get loan with rates lower than 0.2.

## [1] "mean"
##    1 year   3 years   5 years 
## 0.1500807 0.1934855 0.1929907
## [1] "standard deviation"
##     1 year    3 years    5 years 
## 0.06785817 0.07925234 0.05566590

Most loans are either 3-year or 5-year loans, and they have higher borrower rates, and the main difference between 3-year loans and 5-year loans is deivation.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

The Prosper rating is an important factor for the borrower rate. High Prosper rating is linked to lower borrower rates.

Plot Two

Description Two

Only highly rated loans (“AA”, “A”, “B”) can have original amount over $30,000, and most lowly rated loans (“E”, “HR”) are below 10,000. Again Prosper rating is an great indicator of the borrower rate. The different colored areas can be easily recognized.

Plot Three

Description Three

The 3-year loans have higher borrower rates than the 1-year loans in each Prosper rating. The 5-year loans are more concentrated than the 3-year loans and have smaller total range.


Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.

Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!

The Prosper loan dataset contains up to 81 variables, which makes it hard to explore at the beginning. If I just randomly explore each variable in term, there would be no end and I probably would forget things along the way. To solve this problem, I decided to only explore the variables related to the borrower rates and the borrowers’ statuses. I subeted the dataset into a smaller one with only 15 variables.

In my subset dataset, there were only 7 numerical variables and my main interest was between the borrower rate and the various borrower statuses. Therefore, the majority of the visualizations I created included at least one categorical variable or more. I spent a lot of time researching how to create visualization with one, two, or more categorical variables. The solution is to use boxplots, grouped bar charts, jitter plots, and heat maps for the visualizations.

The result of this analysis shows that Prosper rating and term are two major indicators of the borrower rate. Other borrower statuses also have effects on borrower rate but their effects are either small or are also factors of Prosper rating. I think this analysis would be a good reference for anyone who wants to borrower money from a p2p lending platform.